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Abstract 

We propose a non-parametric method to cluster mixed data containing both 
continuous and discrete random variables. The product space of continuous and 
categorical sample spaces is approximated locally by analyzing neighborhoods 
with cluster patterns. Detection of cluster patterns on the product space is 
determined by using a modified Chi-square test. The proposed method does not 
impose a global distance function which could be difficult to specify in practice. 
Results from simulation studies have shown that our proposed methods out- 
performed the benchmark method, AutoClass, for various settings. 

Keywords: AutoClass algorithm, clustering, mixed data, modified chi-square 
test. 

1 Introduction 

Mixed data are abundant in scientific research especially in medical or biological stud- 
ies. An effective clustering method for mixed data will partition a large and complex 
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data set into manageable and homogeneous subgroups. It thus has wide range appli- 
cations in almost any scientific studies including financial data, personalized medicine 
and scientific studies on climate changes. 

Most of the clustering methods in the literature have been mainly focused on 
numerical data. K-mean algorithm has been widely used in the industry for a long 
time. Detailed description and discussions can be found in Kaufman and Rousseeuw 
(2005). To capture the intrinsic geometric properties, a suitable distance function 
such as Manhattan distance or Mahoblis distances can be used when the underlying 
sample space are believed to be non-Euclidean. K-mode algorithm by Huang (1997) 
extends this geometrical approach to categorical data. However, this has proven to be 
not very successful for categorical data as demonstrated in Zhang et. al (2005). The 
geometrical or topological natures of continuous and categorical sample spaces are 
intrinsically different since the first one can be endowed with a differential manifold 
while the second one is defined entirely on a lattice with discontinues functions. Even 
when suitable distance functions are valid for continuous and discrete portion, a 
challenging question is how to combine the metrics from a continuous and a discrete 
sample space. A naive approach is to consider a convex combination of the two 
metrics which implies that the product space of continuous and discrete data can 
be metrizable in this fashion. The major difficulty is on how to choose the weights 
without introducing significant local or global distortions. 

Alternatively, a parametric model based on Gaussian mixture could be used for 
continuous data, see Banfield and Raftery (1993). One of the most prominent meth- 
ods is by Bradley et. al (1998) which can be scaled to large disk- resident data sets. 
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The number of clusters and outliers can be handled simultaneously. Fraley and 
Raftery (1998) propose to choose the number of clusters automatically for model- 
based clustering method. For clustering mixed data, the AutoClass method proposed 
by Cheeseman and Stutz (1995) is well known and could be considered as the bench- 
mark method for model-based clustering method in this class. AutoClass takes a 
database containing both real and discrete valued attributes, and automatically finds 
the number of clusters and groups automatically. This method has widely used in 
NASA and it helped to find infra-red stars in the IRAS Low Resolution Spectral 
catalogue and discovery of classes of proteins. 

Instead a parametric model, we propose a non-parametric clustering method 
which does not assume a global distance function or any knowledge of the functional 
form of the joint probability density function. The key idea is inspired by the fact 
that any complicated manifold is supposed to be "locally" by a manifold with simpler 
structure. For example, it is well known that a differentiable manifold is homeomphic 
to R m . For categorical data, we suppose that a neighborhood on a lattice can be 
sufficiently characterized by the Hamming distance. The Hamming distance is widely 
used in information and coding theory, see Roman (1992) and Laboulias et. al (2002). 
It only measures how many attributes are different without any attempt to impose 
any order on the magnitude of the observed difference. When the true manifold can 
be approximated locally by the product space of two manifolds that adopt either 
Euclidean or Hamming distance, a statistical test is designed based a weighted local 
Chi-squared test. This idea of local test for clustering was first proposed by Zhang 
et. al (2005) for categorical data. 
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This article is organized as follows. The method is proposed in Section 2. The 
clustering algorithm is presented in Section 3. Simulation results are provided in 
Section 4. Discussions are provided in Section 5. 

2 Method 

In this section, we introduce mixed sample space, on which we adopt the Hamming 
distance and Euclidean distance function to measure the relative positions of two data 
points. We define a HD vector, ED vector and optimal separation point which are 
essential component for the proposed weighted local chi-square test for clustering. 

2.1 Joint Sample Space of Mixed Data 

Now consider a general setup for mixed data where p nominal categorical attributes 
and q continuous attributes are of interest. The jth categorical attribute is categorized 
by rrij levels defined by set Aj = (aji, • • • , dj mj ),j — 1, • • • ,p. The categorical portion 
of data, X = (X 1; • • • ,X n ) is collected from n subjects, with Xj = (xf~\ ■ ■ ■ ,x] p ')' 
being the vector of the observed states of p attributes for subject i. The categorical 
sample space, fl p is defined as a collection of all possible p-dimensional vectors of 
states, namely Q p = {(oui, ■ ■ • ,u P Y\ui G A 1 ,--- ,uu p G A p }. The continuous data, 
Z = (Z 1? • • • , Z„) is collected from same n subjects, with Z; = (zf \ ■ ■ ■ , Z^Y, being 
the vector of the observed values of q attributes for subject i, where Z\ G R for 
i — 1, • • • , n and j — 1, • • • , q. The continuous sample space is defined as Q q = R q . 
The mixed data consists of (X, Z) with overall space f2 = £l p <g> Q q . 
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2.2 Distance Vectors 

We use Hamming distance (HD) to measure the relative positions of two categorical 
data points and Euclidean distance (ED) to measure the that of two continuous data 
points. 

To be more specific, for any two positions in the categorical sample space Q p , 
Xft = (uul, • • • jOu^y and Xj = (uu}, ■ • • , cuf)*, the Hamming distance (HD) between 
Xft and Xj on the jth attribute is 



and the distance between the two positions is the sum of the componentwise distances, 



3=1 

For continuous data, the Euclidean distance (ED) between the two positions is defined 

as 



We now introduce HD Vector and ED Vector. Let (S, T) be a reference position in 
the sample space with S = (s\, ■ ■ ■ , s p ) e R p and T — (ti, ■ • ■ , t q ) e R q . We measure 
the distance of all data points to the selected reference point. For the categorical 
portion, HD(Xi, S) can take values ranging from to p. We define the HD vector to 
record the frequencies of each distance value. Namely, the HD vector U (S) is a (p+1)- 
element vector and is defined as (C/ (S), U±(S), ...U P (S)Y where the j th component is 




v 



HD{X h ,X i ) = Y t d j (X h ,X i ). 



ED(z h , Zj) = v/(4 1] - zp? + (4 2] - zf ] y + ■■■ + (4 9] - zfy. 



(i) 
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given by 

n 

Uj(S) = ^ I [™( X - S ) = 3l3 = °> • • 'P. 

i=i 

where 1(A) is the indicator function that takes value 1 when event A happens and 
value when event A does not happen. For continuous portion of the data, in order 
to construct a frequency vector of ED(Zi,T) , we need to choose the bin size. The 
choice of bin size should be user defined. In practice, we find that choosing bin 
size / equal to 10 gives satisfactory empirical result. Let B = (£?!,••• , B{) denote 
a set of equal-sized intervals = 1,2,--- , Z. An ED vector is defined as 

V(T) = [Vi(T), U 2 (T), ...Vj(T)]*, where the jth component is the frequency given by 

n 

Vj(T) = £ I [ED(Zi, T) e Bj], j = 1, • • • , /. 
i=i 

In order to use the HD vector ED vector to detect possible clusters, we define a 
reference or null HD vector and ED vector when there is no clustering pattern in the 
mixed sample space Q p <S> £l q . If there is indeed no pattern, then it is equally likely 
for a randomly chosen data point to take any possible position in the joint sample 
space. The resulting HD vector is called uniform HD vector (UHD) and ED vector is 
called uniform ED vector (UED) which record the the expected frequencies under the 
null hypothesis that there are no clustering patterns in data. Let X be a categorical 
portion of data and Z be a continuous portion of the data from a sample of size n, 
with each observation having an equal probability of locating at any position on space 
£l p <g) £l q . The expected value of HD vector and ED vector associated with the null 
hypothesis, denoted by e = (e , ■ ■ -£p)* and v = (u 1: ■ ■ ■ V]) 1 are denoted as the UHD 
vector and UED vector, respectively. 
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Zhang et. al (2005) proved that the UHD takes the form of £ — ^U*, where 
M = UU m vJ = 1, 2, • • • ,P 5 and U* = ([/*, U*, • • • , U* p ) with 

US = i; 

[/■* = (mi - 1) + (m 2 - 1) + • • • + (m p - 1); 

tf? = Er< J K-i)K-i); 

17* = (mi - l)(m 2 - 1) • ■ ■ (m p - 1). 
For continuous data, the exact distribution of the UED vector is not tractable. We 
then obtain this vector by computer simulations. We simulate random data points 
with q continuous independent attributes. The UED vector is the sampling frequen- 
cies of ED vector from the simulations corresponds to the null hypothesis that there 
are no more than one cluster.. Figure 1 provides the plot of UED vector obtained from 
simulated null hypothesis with no clusters and ED vector obtained from a simulated 
data set with clustering structures. 

2.3 Optimal Separation Point 

If the initial starting point is chosen to be the center of one particular cluster, then the 
frequency of ED should demonstrate a general decreasing pattern as the ED function 
records the frequencies of data points from the center of cluster and outwards. Small 
local bumps at the beginning part of the ED curve are expected if the initial starting 
point deviate slightly from the cluster center. Any substantial reversal of decreasing 
trend will produce a valley area on the ED curve as can be seen from the Figure 
2. This might indicate distances that corresponds to boundary points of the current 
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Figure 1: 

UED VeCtOX 



Plots for ED Vector and UED Vector. 



ED Vector 




cluster. The recorded frequencies might increase when the function records distances 
from another cluster. Therefore, the valley area is an ideal place to perform an 
operation to separate data points from the current cluster with the rest. 

Assume that the categorical data X and continuous data Z are not uniformly dis- 
tributed in the sample space Q p <S> Q q - Let U (S) = (U (S), Ui(S), ■ ■ ■ , U P (S)Y, S e fl p 
be the collection of all (p + l)-element HD vectors in the space Q p and V(T) = 
{V 1 {T),V 2 {T),--- ,Vj(T))*,T G Qg be the collection of all /-element ED vectors in 
the space Q q , and let e = (e , £1, • ' ' , £ P Y be the UHD vector and v = (z/ 1; i/ 2 , • • ■ , v\f 
be the UED vector defined in above subsection. For a given distance value j c ,j = 
0, 1, • • • ,p for categorical distance values and jd,jd = 1, 2, • • • , / for continuous dis- 
tance values, there always exists at least one position (S,T) 6 Q p ®Q q , such that the 
frequency at this distance value is lager than the corresponding component, Sj of the 
UHD vector e and Uj of the UED vector v. In order to compare the HD vector with 
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the UHD vector, and ED vector to UED vector, we introduce a selection criterion 
for an optimal separation or cut-off point r*. The categorical cut-off was defined and 
proved by Zhang et. al (2005). We extend their approach to the continuous portion of 
data. If the clusters structure is present, the early segment of an HD vector and ED 
vector with respect to a data center should contain substantially larger frequencies 
than the corresponding frequencies of the UHD vector and UED vector respectively. 
When the observed distances vectors are intersecting and going below the UHD or 
UED vectors, valley created and they provide good hints about the loca- 

tions of optimal separation points. This leas to an optimal r* for categorical portion 
of data be: 

r:(S)=nun{j| 3 ^<l}-l,SGn p . 

jc>o £jc 

Similarly, optimal r* d for continuous portion of data be: 

r*(T) = min{j|^^<f},TelV 

The vertical line in the Figure 2 is the selected optimal separation point for continuous 
data, where two curve lines are first intercepted. 

3 Algorithm 

There are two key steps for the method. Firstly, we detect whether there exists any 
statistically significant clustering pattern. We propose to use weighted local Chi- 
square test to determine if the observed distance vectors differ significantly from the 
uniform distance vectors associated with no cluster pattern. Secondly, if the patterns 
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Figure 2: Determine cut-off point r. 



Comparing ED and UED 




are significant, we then extract the clusters based on the optimal separation strategies 
described in the previous section. 

We consider the null hypothesis Hq\ There is no clustering pattern in data set. 
The weighted local Chi-squared is defined as: 

X l(r*; (S, T)) = (S ' T)) + (S ' T)) , (S,T)g^, 

v q 
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where the categorical part xl( r *'i S) takes form as: 




(2) 



and the continuous part Xd( r d'i T) takes the form: 



(^j(t) — V3) 2 , (e;ii^(t)-e; 




where p and p are number of attributes from categorical and continuous data respec- 
tively; 

After applying the statistical test with significant result, we proceed to extract 
clusters by determining cluster centers C and estimate cluster radius R for mixed 
data. Therefore, a cluster center C is chosen where the xi, has the maximum value: 



How to determine the cluster size is the next key step to complete cluster extrac- 
tion process. Radius is the term we define to determine the size of a cluster. Zhang 
e£.a/(2005) gave the definition of radius which is the maximum distance of the data 
points in this cluster to its center. Radius is the distance at which the HD vector has 
its very first local minimum. Therefore, he defined categorical radius R C {C) as: 



For continuous part of the data, only those values before cut-off point are sensitive 
for selecting radius. Figure 3 gives empirical CDF plot of ED values where ED 
values jump at certain point. The first jump point is used as the value of continuous 



C = arg max xl 

(S,T) 



R C (C) 



mini {j\Uj(C) < mm(^_i(C),l7 j+ i(C))} - 1. 
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radius. More specifically, during each extraction iteration, we remove those extracted 
data points from the rest of clustering process in order to calculate distance between 
subjects to a fixed reference position. 



Figure 3: Determine Radius Rj for continuous portion of the data 
Empirical CDF plot of obseverd ED values 



T! 




0.4 0.6 

ED values 



The detailed procedures for our method are described as the following: 

Step 1. For each position S, we calculate Hamming distance (HD) in the Categorical 
sample space H c and Euclidean distance (ED) in Continuous sample space H^, 

Step 2. Based on HD and ED, calculate and compare with UHD Vector and UED 
Vector; 
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Step 3 Determine cut-off r*(S) and r^(S) for categorical and continuous data respec- 
tively; and further calculate the corresponding modified Chi-squared statistic 
xl*{S) and Xd*(S) an d obtain test statistic, weighted local chi-square x£{S)', 

Step 4 Corresponding to the weighted local Chi-squared test, select the largest test 
statistic compare it with critical value X(oo5)- ^ the max (X™ (T*)) is 

smaller than X(oo5)> S ^°P ^ ne algorithm; otherwise, continue to step 5; 

Step 5 Assign the position who has either the largest test statistic x^*(T*); 

Step 6 Calculate categorical radius R C (C) and continuous radius Rd(C); label all 
data points within radius in the cluster; remove them from the current data set; 

Step 7 Repeat Step 1 to 7 until no more significant clusters are detected. 

4 Numerical Results 

We carry out simulation studies to examine the performances of our proposed method. 
Classification rates and information gains are calculated to compare the performance 
from our proposed method and AutoClass algorithm. For simplicity, we assume 
all attributes are independent in the mixed data. The simulation setting is as the 
following: 

1. Set the number of categorical attributes p — 10 and each attribute takes rrij 
levels which is randomly selected from the set {4, 5, 6}; Set the number of con- 
tinuous attributes q = 10. 
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2. Set the number of clusters K c = K d = 3 or 5 with various sizes. The number of 
replications is 500. Continuous case 1, 2, 3 are generated from 10 independent 
normal distributions with same mean but variance are ranging from 0.25, 0.5 
and 1 respectively. 

3. For categorical data, in the k th cluster with center C*,, generate 10-attributes 
vectors independently. More specifically, generate for each attribute from a 
multinomial distribution with center probability 0.7 and the rest probability are 
identically equal to 0.3/(mj — 1); For continuous data, rik 10-attributes vectors 
are 10 independent normal random variables with fi = Cfe and o 2 ranging from 
0.25,0.5 and 1, respectively. 

Table 1 to Table 4 provide results from the simulation experiments with 500 repli- 
cations. Averages of classification rates (CR) and information gains (IG) with their 
corresponding standard deviations are used to evaluate two methods' performance. 
Table 1 to Table 4 show the results from simulated data with various settings of 
sample size, number of clusters and cluster sizes. 

Table 1 is obtained from analyzing data withs with a sample size is 200 with 3 
clusters of the sizes of 100, 75 and 25, respectively. Table 2 is obtained from simulated 
data with sample size 200, cluster numbers 3 and each cluster size 130, 45 and 25, 
respectively. Simulated data for Table 3 and 4 have sample size 100 and number of 
clusters is 5, but each cluster size is 40, 25, 15, 10 and 10 for Table 3 and 35, 25, 20, 
10 and 10 for Table 4. 

It can be seen from Table 1 that our proposed algorithm has relatively higher 
classification rate and information gain rate with lower standard deviations corre- 
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spondingly, especially in continuous portion of data. Table 2 shows, compared to Au- 
toClass, our method generates significantly higher CR with lower standard deviations 
for both categorical and continuous data, for instance, for Categoricall, AutoClass 
gives 75% CR with 14% standard deviation, and ours gives 96% CR with 5% standard 
deviation. Table 3 and Table 4 show the same patterns. In summary, speaking, it is 
shown by Table 1 to 4, our proposed algorithm consistently has higher classification 
rate and information gain rate with lower standard deviation correspondingly. 

5 Discussion 

We have proposed a non-parametric clustering method based on weighted modified 
Chi-sqaure test. Numerical results show that the proposed method outperforms the 
AutoClass algorithm based on classification rate and entropy measure for various 
simulation settings. The proposed method is most useful when neither a distance 
function nor a parametric model can be assumed. We will extend this proposed 
method to cluster spatial and temporal data. 



15 



Table 1: Average Classification Rates (CR) and Information Gains (IG) with corresponding 
stand deviation for each method. The sample size is 200 with 3 clusters. Each cluster has 
size 100, 75 and 25, respectively. Replication time is 500. 





Categl 


Contl 


Categ2 


Cont2 


Categ3 


Cont3 






(Var=0.25) 




(Var=0.5) 




(Var=l) 


Autoclass 














CRMean 


0.9694 


0.8278 


0.9705 


0.8283 


0.9737 


0.8421 


CRStd 


0.0478 


0.0542 


0.0483 


0.0545 


0.0360 


0.0611 


IGMean 


0.5508 


1.0000 


0.5787 


1.0000 


0.5582 


1.0000 


IGStd 


0.2201 


0.0000 


0.2170 


0.0000 


0.2142 


0.0000 


Weighted local Chi-s 


quared test 










CRMean 


0.9728 


0.9872 


0.9718 


0.9849 


0.9743 


0.9870 


CRStd 


0.0380 


0.0379 


0.0458 


0.0471 


0.0418 


0.0434 


IGMean 


0.8956 


0.9683 


0.8959 


0.9646 


0.9027 


0.9706 


IGStd 


0.1128 


0.0942 


0.1174 


0.1044 


0.1099 


0.0955 
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Table 2: Average CR and IG with corresponding stand deviation for each method. The 
sample size is 200 with 3 clusters Each cluster has size 130, 45 and 25, respectively. 
Replication is 500 times. 





Categl 


Contl 


Categ2 


Cont2 


Categ3 


Cont3 






(Var=0.25) 




(Var=0.5) 




(Var=l) 


Autoclass 














CRMean 


0.7497 


0.6919 


0.7530 


0.6852 


0.7618 


0.6924 


CRStd 


0.1375 


0.0742 


0.1407 


0.0599 


0.1414 


0.0664 


IGMean 


0.6200 


1.0000 


0.6361 


1.0000 


0.6246 


1.0000 


IGStd 


0.2398 


0.0000 


0.2248 


0.0000 


0.2408 


0.0000 


Weighted local Chi-s 


quared test 










CRMean 


0.9633 


0.9742 


0.9680 


0.9795 


0.9648 


0.9762 


CRStd 


0.0549 


0.0588 


0.0493 


0.0531 


0.0556 


0.0590 


IGMean 


0.8633 


0.9383 


0.8764 


0.9511 


0.8694 


0.9426 


IGStd 


0.1585 


0.1432 


0.1421 


0.1287 


0.1584 


0.1475 
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Table 3: Average CR and IG with corresponding standard deviations for each method. The 
sample size is 100 with 5 clusters. Each cluster has size 40, 25, 15,10 and 10, respectively. 
Replication time is 500. 





Categl 


Contl 


Categ2 


Cont2 


Categ3 


Cont3 






(Var=0.25) 




(Var=0.5) 




(Var=l) 


Autoclass 














CRMean 


0.7667 


0.7592 


0.7686 


0.7268 


0.7619 


0.7035 


CRStd 


0.0440 


0.0403 


0.0483 


0.0425 


0.0497 


0.0371 


IGMean 


0.6423 


0.8699 


0.6422 


0.8699 


0.6390 


0.8699 


IGStd 


0.1958 


0.0000 


0.1938 


0.0000 


0.2014 


0.0000 


Weighted local Chi-s 


quared test 










CRMean 


0.8503 


0.8745 


0.8507 


0.8747 


0.8503 


0.8741 


CRStd 


0.1002 


0.1045 


0.1011 


0.1047 


0.0998 


0.1035 


IGMean 


0.7172 


0.8432 


0.7174 


0.8460 


0.7171 


0.8439 


IGStd 


0.1666 


0.1474 


0.1695 


0.1434 


0.1645 


0.1438 
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Table 4: CR and IG with corresponding stand deviations for each method. The sample size 
is 100 with 5 clusters. Each cluster has size 35, 25, 20, 10 and 10, respectively. Replication 
time is 500. 





Categl 


Contl 


Categ2 


Cont2 


Categ3 


Cont3 






(Var=0.25) 




(Var=0.5) 




(Var=l) 


Autoclass 














CRMean 


0.7901 


0.6751 


0.7887 


0.6532 


0.7910 


0.6321 


CRStd 


0.0515 


0.0338 


0.0505 


0.0370 


0.0510 


0.0346 


IGMean 


0.6728 


0.5419 


0.6695 


0.5519 


0.6813 


0.5570 


IGStd 


0.1958 


0.0236 


0.1923 


0.0310 


0.1846 


0.0312 


Weighted local Chi-squared test 










CRMean 


0.8634 


0.8879 


0.8581 


0.8843 


0.8699 


0.8959 


CRStd 


0.1009 


0.1019 


0.1015 


0.1013 


0.0999 


0.0984 


IGMean 


0.7454 


0.8586 


0.7352 


0.8536 


0.7565 


0.8693 


IGStd 


0.1608 


0.1384 


0.1622 


0.1373 


0.1572 


0.1286 
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